-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DNM] kv: prototype async Raft log writes #87050
[DNM] kv: prototype async Raft log writes #87050
Conversation
See cockroachdb#17500. This commit polishes and flushes a prototype I've had lying around that demonstrates the async Raft log appends component of cockroachdb#17500. I'm not actively planning to productionize this, but it sounds like we may work on this project in v23.1, so this prototype might help. It also demonstrates the kind of performance wins we can expect to see on write-heavy workloads. To this point, we had only [demonstrated the potential speedup](cockroachdb#17500 (comment)) in a simulated environment with [rafttoy](https://github.com/nvanbenschoten/rafttoy). Half of the change here is to `etcd/raft` itself, which needs to be adapted to support asynchronous log writes. These changes are presented in nvanbenschoten/etcd@1d1fa32. The other half of the change is extracting a Raft log writer component that handles the process of asynchronously appending to a collection of Raft logs and notifying individual replicas about the eventual durability of these writes. This component is pretty basic and should probably be entirely rewritten, but it gets the job done for the prototype. The Raft log writer reveals an interesting dynamic where concurrency at this level actually hurts performance because it leads to concurrent calls to sync Pebble's WAL, which is less performant than having a single caller due to the fact that Pebble only exposes a synchronous Sync API and coalesces all Sync requests on to a single thread. An async Pebble Sync API would be valuable here. See the comment in NewWriter for more details. \### Benchmarks ``` name old ops/s new ops/s delta kv0/enc=false/nodes=3/cpu=32 36.4k ± 5% 46.5k ± 5% +27.64% (p=0.000 n=10+10) name old avg(ms) new avg(ms) delta kv0/enc=false/nodes=3/cpu=32 5.26 ± 3% 4.14 ± 6% -21.33% (p=0.000 n=8+10) name old p99(ms) new p99(ms) delta kv0/enc=false/nodes=3/cpu=32 10.9 ± 8% 9.1 ±10% -15.94% (p=0.000 n=10+10) ``` These are compelling results. I haven't pushed on this enough to know whether there's actually a throughput win here, or whether the fixed concurrency and reduced average latency is just making it look like there is. `kv0bench` should help answer that question.
Thanks @nvanbenschoten! These results are certainly compelling, and even if we only ended up with half of this on real workloads that'd still be fantastic. The relatively limited scope of the changes here is also reassuring.
Unfortunately, this didn't make the cut, and may have to wait until 23.2. We're doing the Raft log separation this cycle, but that should dovetail nicely into this work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)
pkg/kv/kvserver/raftlog/writer.go
line 73 at r1 (raw file):
// some goroutine wait on a sync to be notified of durability. An asynchronous // variant of Pebble's API (similar to Pebble's own syncQueue) that is hooked // directly into Pebble's LogWriter.flushLoop would help.
I reread our slack thread from May 19, which touched on this.
Our current synchronous scheme incurs a mean of 1.5x the latency of a single LogWriter flush+sync since the LogWriter is working on something and must finish it before flushing again. And we ignore the latency incurred in waiting for a raftScheduler.worker
because we assume that there are enough workers that someone will be free. This assumption partly relies on the fact that there is no deterministic assignment of replicas=>workers, so we don't have to worry about getting a hotspot because of a poor assignment. If we mirror that approach in these writers, i.e., no deterministic assignment, and enough writers, we can continue to assume zero latency in waiting for the writer to be free. Not having a deterministic assignment is also generally good to smear out the load. So I don't think we need an async API in Pebble.
… log This commit replaces the call to `Term(raftAppliedIndex)` with direct use of the new `RaftAppliedIndexTerm` field (added in c3bc064) when generating a `SnapshotMetadata` in service of the `raft.Storage.Snapshot` interface. As of v22.2, this field has been fully migrated in. First and foremost, this is a code simplification. However, it also helps with projects like cockroachdb#87050, where async Raft log writes make it possible for a Raft leader to apply an entry before it has been appended to the leader's own log. Such flexibility[^1] would help smooth out tail latency in any single replica's local log writes, even if that replica is the leader itself. This is an important characteristic of quorum systems that we fail to provide because of the tight coupling between the Raft leader's own log writes and the Raft leader's acknowledgement of committed proposals. [^1]: if safe, I haven't convinced myself that it is in all cases. It certainly is not for operations like non-loosely coupled log truncation.
88419: ui: add connected components for insights r=ericharmeling a=ericharmeling This commit adds connected components for the workload and schema insights pages, for use in the CC Console. Fixes #87693. https://www.loom.com/share/08163a7c125948119ca71ac097099a29 Release note: None 88510: ui: move `latestQuery`, `latestFormattedQuery` from redux to local state r=xinhaoz a=xinhaoz Previously, the fields `latestQuery` and `latestFormattedQuery` representing the latest non-empty query string for a statement viewed from the detaisl page was being stored in redux. The purpose of these fields was to preserve the query when changing tabs in the stmt details page. Saving this in the redux store was unnecessary and so this commit moves these fields to the stmt details local state. Release note: None https://www.loom.com/share/a28d412fb83a429391210935982404de 88594: cloud: add version gate for auth via assume role in AWS and GCP stora… r=rhu713 a=rhu713 …ge and KMS Add a version gate for auth via assume role in AWS and GCP storage and KMS to prevent this type of auth until all nodes in the cluster has been upgraded to 22.2. The gate prevents a class of job failures where sometimes a job can succeed with assume role auth if its processors happen to all be on 22.2 nodes, but fail at times when one of its processor nodes don't support assume role. This version gate preempts the issue by preventing this type of auth until the cluster has been finalized on 22.2 and gives a better error message of why the auth cannot be used. It's important to note that this gate does not prevent a user from creating a BACKUP job that uses assume role auth, e.g. via the DETACHED option, because the destination storage is not accessed during planning. This is inline with existing behavior for other types of auth errors, e.g. if the user enters incorrect credentials. The BACKUP job will still fail with the version gate error when it eventually executes. Release note: None 88596: kv: use RaftAppliedIndexTerm to generate SnapshotMetadata, don't scan log r=nvanbenschoten a=nvanbenschoten This commit replaces the call to `Term(raftAppliedIndex)` with direct use of the new `RaftAppliedIndexTerm` field (added in c3bc064) when generating a `SnapshotMetadata` in service of the `raft.Storage.Snapshot` interface. As of v22.2, this field has been fully migrated in. First and foremost, this is a code simplification. However, it also helps with projects like #87050, where async Raft log writes make it possible for a Raft leader to apply an entry before it has been appended to the leader's own log. Such flexibility[^1] would help smooth out tail latency in any single replica's local log writes, even if that replica is the leader itself. This is an important characteristic of quorum systems that we fail to provide because of the tight coupling between the Raft leader's own log writes and the Raft leader's acknowledgment of committed proposals. Release justification: None. Don't backport to release-22.2. Release note: None. [^1]: if safe, I haven't convinced myself that it is in all cases. It certainly is not for operations like non-loosely coupled log truncation. 88800: kvserver: use type-safe atomics in raftSendQueue r=nvanbenschoten a=pavelkalinnikov Go 1.19 introduced atomic types that enforce atomic access to variables, which in many situation is less error-prone. This commit resolves a TODO to take advantage of these types. Release note: None Co-authored-by: Eric Harmeling <eric.harmeling@cockroachlabs.com> Co-authored-by: Xin Hao Zhang <xzhang@cockroachlabs.com> Co-authored-by: Rui Hu <rui@cockroachlabs.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)
pkg/kv/kvserver/raftlog/writer.go
line 73 at r1 (raw file):
[Sumeer and I discussed this over lunch]
This assumption partly relies on the fact that there is no deterministic assignment of replicas=>workers, so we don't have to worry about getting a hotspot because of a poor assignment. If we mirror that approach in these writers, i.e., no deterministic assignment, and enough writers, we can continue to assume zero latency in waiting for the writer to be free.
While this is true across ranges, we do still need to pay for ordering within a range because we need log writes for a given range to be performed in increasing index order without the possibility of gaps. This is where the desire for an async API comes from. Without an async API, a log write for a given range needs to wait for the sequencing (seq num assignment) and the sync of previous writes for that range. Ideally, it would only need to wait for the sequencing of previous writes and could then independently wait for the sync.
Assuming multiple ranges and a full replication pipeline within each range, the sync Pebble API incurs an average-case 1.5x latency multiplier to order log writes within a range and then another average-case 1.5x latency multiplier when syncing log writes while waiting for syncs from other ranges due to the single LogWriter flush+sync loop you mentioned. Combined, that means that the average-case latency multiplier to get through this Write.Append
call is 2.25x.
shards = 1
helps because it avoids the second multiplier, but only because no one else is syncing in the system and as a result, the LogWriter flush+sync loop idles when not being synced by this writer. I think one way to get around the absence of an async API is to stop sync-ing individual WriteBatch writes and to lift our own flushing loop up here that calls Pebble.SyncWAL
in a loop. That would allow us to build our own async API. But I'd like to avoid that. That also only helps if no one else in the system is ever driving the Pebble LogWriter
sync loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)
pkg/kv/kvserver/raftlog/writer.go
line 73 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
[Sumeer and I discussed this over lunch]
This assumption partly relies on the fact that there is no deterministic assignment of replicas=>workers, so we don't have to worry about getting a hotspot because of a poor assignment. If we mirror that approach in these writers, i.e., no deterministic assignment, and enough writers, we can continue to assume zero latency in waiting for the writer to be free.
While this is true across ranges, we do still need to pay for ordering within a range because we need log writes for a given range to be performed in increasing index order without the possibility of gaps. This is where the desire for an async API comes from. Without an async API, a log write for a given range needs to wait for the sequencing (seq num assignment) and the sync of previous writes for that range. Ideally, it would only need to wait for the sequencing of previous writes and could then independently wait for the sync.
Assuming multiple ranges and a full replication pipeline within each range, the sync Pebble API incurs an average-case 1.5x latency multiplier to order log writes within a range and then another average-case 1.5x latency multiplier when syncing log writes while waiting for syncs from other ranges due to the single LogWriter flush+sync loop you mentioned. Combined, that means that the average-case latency multiplier to get through this
Write.Append
call is 2.25x.
shards = 1
helps because it avoids the second multiplier, but only because no one else is syncing in the system and as a result, the LogWriter flush+sync loop idles when not being synced by this writer. I think one way to get around the absence of an async API is to stop sync-ing individual WriteBatch writes and to lift our own flushing loop up here that callsPebble.SyncWAL
in a loop. That would allow us to build our own async API. But I'd like to avoid that. That also only helps if no one else in the system is ever driving the PebbleLogWriter
sync loop.
I agree that CockroachDB should not need to call something like Pebble.SyncWAL
in a loop since it becomes complicated to coordinate what got synced.
I'll get you to explain the 2.25x calculation in front of a whiteboard. I get the same answer but by: 1.5x fsync for doing a fsync'd write to Pebble + 1.5x/2 for waiting for this worker thread that is doing the write to free up = 2.25x. What you are asking for would make this 1x latency, and perhaps more importantly would remove the 1.5x fsync of wait for IO time from the worker thread, which allows for higher throughput on the range (which may allow us to make the ranges fatter), and makes the throughput (assuming an open loop workload), insensitive to variations in fsync latency.
Making the Pebble change should be easy -- do you want it now for this prototype, or later when we are writing the production code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)
pkg/kv/kvserver/raftlog/writer.go
line 73 at r1 (raw file):
I'll get you to explain the 2.25x calculation in front of a whiteboard
👍 I was also not able to really confirm this one. Also, depending on how you model it, I think the actual wait might be longer. When the inflight shard finishes, and ours starts, it has no chance to catch a 50% done pebble sync. It'll either be the one starting the sync or it has to latch onto a sync that just started. The average is 0.5, but only if you assume both cases are equally likely. In a system where many ranges are seeing write, the average is closer to 1.
Anyway, it seems clear that a an API that asynchronously confirms inflight fsyncs is the right solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)
pkg/kv/kvserver/raftlog/writer.go
line 73 at r1 (raw file):
The average is 0.5, but only if you assume both cases are equally likely. In a system where many ranges are seeing write, the average is closer to 1.
For the case where all writer shards are writing+syncing in a tight loop, I think you are correct that each individual shard ends up participating in every second sync. This leads to an expected wait time of 1 sync, so using multiple shards doubles the latency of any individual shard's call to sync. That's directly in line with my prototype results.
Making the Pebble change should be easy -- do you want it now for this prototype, or later when we are writing the production code?
If you think the changes should be easy and are willing to make them then I'd be happy to integrate them into this PR. I've been working on the process of productionizing the changes to etcd/raft, which should be in a good enough state to rebase this PR on shortly.
ApplyNoSyncWait must only be used when WriteOptions.Sync is true. It enqueues the Batch to the WAL, adds to the memtable, and waits until the batch is visible in the memtable, and then returns to the caller. The caller is responsible for calling Batch.SyncWait to wait until the write to the WAL is fsynced. This change required splitting the WaitGroup in the Batch into two WaitGroups, so waiting for the visibility can happen separately from waiting for the WAL write. Additionally, the channel used as a semaphore for reserving space in the two lock-free queues is split into two channels, since dequeueing from these queues can happen in arbitrary order. There may be some performance overhead of pushing and popping from two channels instead of one. Informs cockroachdb/cockroach#17500 See discussion thread cockroachdb/cockroach#87050 (review)
ApplyNoSyncWait must only be used when WriteOptions.Sync is true. It enqueues the Batch to the WAL, adds to the memtable, and waits until the batch is visible in the memtable, and then returns to the caller. The caller is responsible for calling Batch.SyncWait to wait until the write to the WAL is fsynced. This change required splitting the WaitGroup in the Batch into two WaitGroups, so waiting for the visibility can happen separately from waiting for the WAL write. Additionally, the channel used as a semaphore for reserving space in the two lock-free queues is split into two channels, since dequeueing from these queues can happen in arbitrary order. There may be some performance overhead of pushing and popping from two channels instead of one. Informs cockroachdb/cockroach#17500 See discussion thread cockroachdb/cockroach#87050 (review)
ApplyNoSyncWait must only be used when WriteOptions.Sync is true. It enqueues the Batch to the WAL, adds to the memtable, and waits until the batch is visible in the memtable, and then returns to the caller. The caller is responsible for calling Batch.SyncWait to wait until the write to the WAL is fsynced. This change required splitting the WaitGroup in the Batch into two WaitGroups, so waiting for the visibility can happen separately from waiting for the WAL write. Additionally, the channel used as a semaphore for reserving space in the two lock-free queues is split into two channels, since dequeueing from these queues can happen in arbitrary order. There may be some performance overhead of pushing and popping from two channels instead of one. Informs cockroachdb/cockroach#17500 See discussion thread cockroachdb/cockroach#87050 (review)
ApplyNoSyncWait must only be used when WriteOptions.Sync is true. It enqueues the Batch to the WAL, adds to the memtable, and waits until the batch is visible in the memtable, and then returns to the caller. The caller is responsible for calling Batch.SyncWait to wait until the write to the WAL is fsynced. This change required splitting the WaitGroup in the Batch into two WaitGroups, so waiting for the visibility can happen separately from waiting for the WAL write. Additionally, the channel used as a semaphore for reserving space in the two lock-free queues is split into two channels, since dequeueing from these queues can happen in arbitrary order. There may be some performance overhead of pushing and popping from two channels instead of one. Informs cockroachdb/cockroach#17500 See discussion thread cockroachdb/cockroach#87050 (review)
ApplyNoSyncWait must only be used when WriteOptions.Sync is true. It enqueues the Batch to the WAL, adds to the memtable, and waits until the batch is visible in the memtable, and then returns to the caller. The caller is responsible for calling Batch.SyncWait to wait until the write to the WAL is fsynced. This change required splitting the WaitGroup in the Batch into two WaitGroups, so waiting for the visibility can happen separately from waiting for the WAL write. Additionally, the channel used as a semaphore for reserving space in the two lock-free queues is split into two channels, since dequeueing from these queues can happen in arbitrary order. Benchmarks indicate that the overhead of pushing and popping on an extra channel is tolerable. Benchmarks were run on a macbook pro -- note these are not doing an actual sync since they use io.Discard, and are only benchmarking the commit pipeline. Sync wait on master (old) vs this branch (new): name old time/op new time/op delta CommitPipeline/no-sync-wait=false/parallel=1-10 1.09µs ± 6% 1.15µs ± 9% ~ (p=0.310 n=5+5) CommitPipeline/no-sync-wait=false/parallel=2-10 1.53µs ± 4% 1.54µs ± 2% ~ (p=0.841 n=5+5) CommitPipeline/no-sync-wait=false/parallel=4-10 1.54µs ± 1% 1.59µs ± 1% +2.87% (p=0.008 n=5+5) CommitPipeline/no-sync-wait=false/parallel=8-10 1.52µs ± 1% 1.55µs ± 1% +2.43% (p=0.008 n=5+5) name old speed new speed delta CommitPipeline/no-sync-wait=false/parallel=1-10 14.7MB/s ± 5% 13.9MB/s ±10% ~ (p=0.310 n=5+5) CommitPipeline/no-sync-wait=false/parallel=2-10 10.5MB/s ± 4% 10.4MB/s ± 2% ~ (p=0.841 n=5+5) CommitPipeline/no-sync-wait=false/parallel=4-10 10.4MB/s ± 1% 10.1MB/s ± 1% -2.78% (p=0.008 n=5+5) CommitPipeline/no-sync-wait=false/parallel=8-10 10.5MB/s ± 1% 10.3MB/s ± 1% -2.35% (p=0.008 n=5+5) name old alloc/op new alloc/op delta CommitPipeline/no-sync-wait=false/parallel=1-10 1.37kB ± 0% 1.40kB ± 0% +2.31% (p=0.008 n=5+5) CommitPipeline/no-sync-wait=false/parallel=2-10 1.37kB ± 0% 1.40kB ± 0% +2.31% (p=0.008 n=5+5) CommitPipeline/no-sync-wait=false/parallel=4-10 1.37kB ± 0% 1.40kB ± 0% +2.15% (p=0.008 n=5+5) CommitPipeline/no-sync-wait=false/parallel=8-10 1.37kB ± 0% 1.40kB ± 0% +2.34% (p=0.008 n=5+5) name old allocs/op new allocs/op delta CommitPipeline/no-sync-wait=false/parallel=1-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/no-sync-wait=false/parallel=2-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/no-sync-wait=false/parallel=4-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/no-sync-wait=false/parallel=8-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) Sync wait on this branch (old) vs async wait on this branch (new): name old time/op new time/op delta CommitPipeline/parallel=1-10 1.15µs ± 9% 1.20µs ± 7% ~ (p=0.421 n=5+5) CommitPipeline/parallel=2-10 1.54µs ± 2% 1.59µs ± 2% +3.50% (p=0.008 n=5+5) CommitPipeline/parallel=4-10 1.59µs ± 1% 1.58µs ± 1% ~ (p=0.802 n=5+5) CommitPipeline/parallel=8-10 1.55µs ± 1% 1.56µs ± 1% ~ (p=0.452 n=5+5) name old speed new speed delta CommitPipeline/parallel=1-10 13.9MB/s ±10% 13.3MB/s ± 7% ~ (p=0.421 n=5+5) CommitPipeline/parallel=2-10 10.4MB/s ± 2% 10.1MB/s ± 2% -3.36% (p=0.008 n=5+5) CommitPipeline/parallel=4-10 10.1MB/s ± 1% 10.1MB/s ± 1% ~ (p=0.786 n=5+5) CommitPipeline/parallel=8-10 10.3MB/s ± 1% 10.3MB/s ± 1% ~ (p=0.452 n=5+5) name old alloc/op new alloc/op delta CommitPipeline/parallel=1-10 1.40kB ± 0% 1.40kB ± 0% ~ (p=0.651 n=5+5) CommitPipeline/parallel=2-10 1.40kB ± 0% 1.39kB ± 0% -0.21% (p=0.008 n=5+5) CommitPipeline/parallel=4-10 1.40kB ± 0% 1.40kB ± 0% ~ (p=0.706 n=5+5) CommitPipeline/parallel=8-10 1.40kB ± 0% 1.40kB ± 0% ~ (p=0.587 n=5+5) name old allocs/op new allocs/op delta CommitPipeline/parallel=1-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/parallel=2-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/parallel=4-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/parallel=8-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) Informs cockroachdb/cockroach#17500 See discussion thread cockroachdb/cockroach#87050 (review)
Replaced by #94165. |
ApplyNoSyncWait must only be used when WriteOptions.Sync is true. It enqueues the Batch to the WAL, adds to the memtable, and waits until the batch is visible in the memtable, and then returns to the caller. The caller is responsible for calling Batch.SyncWait to wait until the write to the WAL is fsynced. This change required splitting the WaitGroup in the Batch into two WaitGroups, so waiting for the visibility can happen separately from waiting for the WAL write. Additionally, the channel used as a semaphore for reserving space in the two lock-free queues is split into two channels, since dequeueing from these queues can happen in arbitrary order. Benchmarks indicate that the overhead of pushing and popping on an extra channel is tolerable. Benchmarks were run on a macbook pro -- note these are not doing an actual sync since they use io.Discard, and are only benchmarking the commit pipeline. Sync wait on master (old) vs this branch (new): name old time/op new time/op delta CommitPipeline/no-sync-wait=false/parallel=1-10 1.09µs ± 6% 1.15µs ± 9% ~ (p=0.310 n=5+5) CommitPipeline/no-sync-wait=false/parallel=2-10 1.53µs ± 4% 1.54µs ± 2% ~ (p=0.841 n=5+5) CommitPipeline/no-sync-wait=false/parallel=4-10 1.54µs ± 1% 1.59µs ± 1% +2.87% (p=0.008 n=5+5) CommitPipeline/no-sync-wait=false/parallel=8-10 1.52µs ± 1% 1.55µs ± 1% +2.43% (p=0.008 n=5+5) name old speed new speed delta CommitPipeline/no-sync-wait=false/parallel=1-10 14.7MB/s ± 5% 13.9MB/s ±10% ~ (p=0.310 n=5+5) CommitPipeline/no-sync-wait=false/parallel=2-10 10.5MB/s ± 4% 10.4MB/s ± 2% ~ (p=0.841 n=5+5) CommitPipeline/no-sync-wait=false/parallel=4-10 10.4MB/s ± 1% 10.1MB/s ± 1% -2.78% (p=0.008 n=5+5) CommitPipeline/no-sync-wait=false/parallel=8-10 10.5MB/s ± 1% 10.3MB/s ± 1% -2.35% (p=0.008 n=5+5) name old alloc/op new alloc/op delta CommitPipeline/no-sync-wait=false/parallel=1-10 1.37kB ± 0% 1.40kB ± 0% +2.31% (p=0.008 n=5+5) CommitPipeline/no-sync-wait=false/parallel=2-10 1.37kB ± 0% 1.40kB ± 0% +2.31% (p=0.008 n=5+5) CommitPipeline/no-sync-wait=false/parallel=4-10 1.37kB ± 0% 1.40kB ± 0% +2.15% (p=0.008 n=5+5) CommitPipeline/no-sync-wait=false/parallel=8-10 1.37kB ± 0% 1.40kB ± 0% +2.34% (p=0.008 n=5+5) name old allocs/op new allocs/op delta CommitPipeline/no-sync-wait=false/parallel=1-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/no-sync-wait=false/parallel=2-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/no-sync-wait=false/parallel=4-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/no-sync-wait=false/parallel=8-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) Sync wait on this branch (old) vs async wait on this branch (new): name old time/op new time/op delta CommitPipeline/parallel=1-10 1.15µs ± 9% 1.20µs ± 7% ~ (p=0.421 n=5+5) CommitPipeline/parallel=2-10 1.54µs ± 2% 1.59µs ± 2% +3.50% (p=0.008 n=5+5) CommitPipeline/parallel=4-10 1.59µs ± 1% 1.58µs ± 1% ~ (p=0.802 n=5+5) CommitPipeline/parallel=8-10 1.55µs ± 1% 1.56µs ± 1% ~ (p=0.452 n=5+5) name old speed new speed delta CommitPipeline/parallel=1-10 13.9MB/s ±10% 13.3MB/s ± 7% ~ (p=0.421 n=5+5) CommitPipeline/parallel=2-10 10.4MB/s ± 2% 10.1MB/s ± 2% -3.36% (p=0.008 n=5+5) CommitPipeline/parallel=4-10 10.1MB/s ± 1% 10.1MB/s ± 1% ~ (p=0.786 n=5+5) CommitPipeline/parallel=8-10 10.3MB/s ± 1% 10.3MB/s ± 1% ~ (p=0.452 n=5+5) name old alloc/op new alloc/op delta CommitPipeline/parallel=1-10 1.40kB ± 0% 1.40kB ± 0% ~ (p=0.651 n=5+5) CommitPipeline/parallel=2-10 1.40kB ± 0% 1.39kB ± 0% -0.21% (p=0.008 n=5+5) CommitPipeline/parallel=4-10 1.40kB ± 0% 1.40kB ± 0% ~ (p=0.706 n=5+5) CommitPipeline/parallel=8-10 1.40kB ± 0% 1.40kB ± 0% ~ (p=0.587 n=5+5) name old allocs/op new allocs/op delta CommitPipeline/parallel=1-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/parallel=2-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/parallel=4-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/parallel=8-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) Informs cockroachdb/cockroach#17500 See discussion thread cockroachdb/cockroach#87050 (review)
ApplyNoSyncWait must only be used when WriteOptions.Sync is true. It enqueues the Batch to the WAL, adds to the memtable, and waits until the batch is visible in the memtable, and then returns to the caller. The caller is responsible for calling Batch.SyncWait to wait until the write to the WAL is fsynced. This change required splitting the WaitGroup in the Batch into two WaitGroups, so waiting for the visibility can happen separately from waiting for the WAL write. Additionally, the channel used as a semaphore for reserving space in the two lock-free queues is split into two channels, since dequeueing from these queues can happen in arbitrary order. Benchmarks indicate that the overhead of pushing and popping on an extra channel is tolerable. Benchmarks were run on a macbook pro -- note these are not doing an actual sync since they use io.Discard, and are only benchmarking the commit pipeline. Sync wait on master (old) vs this branch (new): name old time/op new time/op delta CommitPipeline/no-sync-wait=false/parallel=1-10 1.09µs ± 6% 1.15µs ± 9% ~ (p=0.310 n=5+5) CommitPipeline/no-sync-wait=false/parallel=2-10 1.53µs ± 4% 1.54µs ± 2% ~ (p=0.841 n=5+5) CommitPipeline/no-sync-wait=false/parallel=4-10 1.54µs ± 1% 1.59µs ± 1% +2.87% (p=0.008 n=5+5) CommitPipeline/no-sync-wait=false/parallel=8-10 1.52µs ± 1% 1.55µs ± 1% +2.43% (p=0.008 n=5+5) name old speed new speed delta CommitPipeline/no-sync-wait=false/parallel=1-10 14.7MB/s ± 5% 13.9MB/s ±10% ~ (p=0.310 n=5+5) CommitPipeline/no-sync-wait=false/parallel=2-10 10.5MB/s ± 4% 10.4MB/s ± 2% ~ (p=0.841 n=5+5) CommitPipeline/no-sync-wait=false/parallel=4-10 10.4MB/s ± 1% 10.1MB/s ± 1% -2.78% (p=0.008 n=5+5) CommitPipeline/no-sync-wait=false/parallel=8-10 10.5MB/s ± 1% 10.3MB/s ± 1% -2.35% (p=0.008 n=5+5) name old alloc/op new alloc/op delta CommitPipeline/no-sync-wait=false/parallel=1-10 1.37kB ± 0% 1.40kB ± 0% +2.31% (p=0.008 n=5+5) CommitPipeline/no-sync-wait=false/parallel=2-10 1.37kB ± 0% 1.40kB ± 0% +2.31% (p=0.008 n=5+5) CommitPipeline/no-sync-wait=false/parallel=4-10 1.37kB ± 0% 1.40kB ± 0% +2.15% (p=0.008 n=5+5) CommitPipeline/no-sync-wait=false/parallel=8-10 1.37kB ± 0% 1.40kB ± 0% +2.34% (p=0.008 n=5+5) name old allocs/op new allocs/op delta CommitPipeline/no-sync-wait=false/parallel=1-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/no-sync-wait=false/parallel=2-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/no-sync-wait=false/parallel=4-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/no-sync-wait=false/parallel=8-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) Sync wait on this branch (old) vs async wait on this branch (new): name old time/op new time/op delta CommitPipeline/parallel=1-10 1.15µs ± 9% 1.20µs ± 7% ~ (p=0.421 n=5+5) CommitPipeline/parallel=2-10 1.54µs ± 2% 1.59µs ± 2% +3.50% (p=0.008 n=5+5) CommitPipeline/parallel=4-10 1.59µs ± 1% 1.58µs ± 1% ~ (p=0.802 n=5+5) CommitPipeline/parallel=8-10 1.55µs ± 1% 1.56µs ± 1% ~ (p=0.452 n=5+5) name old speed new speed delta CommitPipeline/parallel=1-10 13.9MB/s ±10% 13.3MB/s ± 7% ~ (p=0.421 n=5+5) CommitPipeline/parallel=2-10 10.4MB/s ± 2% 10.1MB/s ± 2% -3.36% (p=0.008 n=5+5) CommitPipeline/parallel=4-10 10.1MB/s ± 1% 10.1MB/s ± 1% ~ (p=0.786 n=5+5) CommitPipeline/parallel=8-10 10.3MB/s ± 1% 10.3MB/s ± 1% ~ (p=0.452 n=5+5) name old alloc/op new alloc/op delta CommitPipeline/parallel=1-10 1.40kB ± 0% 1.40kB ± 0% ~ (p=0.651 n=5+5) CommitPipeline/parallel=2-10 1.40kB ± 0% 1.39kB ± 0% -0.21% (p=0.008 n=5+5) CommitPipeline/parallel=4-10 1.40kB ± 0% 1.40kB ± 0% ~ (p=0.706 n=5+5) CommitPipeline/parallel=8-10 1.40kB ± 0% 1.40kB ± 0% ~ (p=0.587 n=5+5) name old allocs/op new allocs/op delta CommitPipeline/parallel=1-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/parallel=2-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/parallel=4-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/parallel=8-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) Informs cockroachdb/cockroach#17500 See discussion thread cockroachdb/cockroach#87050 (review)
ApplyNoSyncWait must only be used when WriteOptions.Sync is true. It enqueues the Batch to the WAL, adds to the memtable, and waits until the batch is visible in the memtable, and then returns to the caller. The caller is responsible for calling Batch.SyncWait to wait until the write to the WAL is fsynced. This change required splitting the WaitGroup in the Batch into two WaitGroups, so waiting for the visibility can happen separately from waiting for the WAL write. Additionally, the channel used as a semaphore for reserving space in the two lock-free queues is split into two channels, since dequeueing from these queues can happen in arbitrary order. Benchmarks indicate that the overhead of pushing and popping on an extra channel is tolerable. Benchmarks were run on a macbook pro -- note these are not doing an actual sync since they use io.Discard, and are only benchmarking the commit pipeline. Sync wait on master (old) vs this branch (new): name old time/op new time/op delta CommitPipeline/no-sync-wait=false/parallel=1-10 1.09µs ± 6% 1.15µs ± 9% ~ (p=0.310 n=5+5) CommitPipeline/no-sync-wait=false/parallel=2-10 1.53µs ± 4% 1.54µs ± 2% ~ (p=0.841 n=5+5) CommitPipeline/no-sync-wait=false/parallel=4-10 1.54µs ± 1% 1.59µs ± 1% +2.87% (p=0.008 n=5+5) CommitPipeline/no-sync-wait=false/parallel=8-10 1.52µs ± 1% 1.55µs ± 1% +2.43% (p=0.008 n=5+5) name old speed new speed delta CommitPipeline/no-sync-wait=false/parallel=1-10 14.7MB/s ± 5% 13.9MB/s ±10% ~ (p=0.310 n=5+5) CommitPipeline/no-sync-wait=false/parallel=2-10 10.5MB/s ± 4% 10.4MB/s ± 2% ~ (p=0.841 n=5+5) CommitPipeline/no-sync-wait=false/parallel=4-10 10.4MB/s ± 1% 10.1MB/s ± 1% -2.78% (p=0.008 n=5+5) CommitPipeline/no-sync-wait=false/parallel=8-10 10.5MB/s ± 1% 10.3MB/s ± 1% -2.35% (p=0.008 n=5+5) name old alloc/op new alloc/op delta CommitPipeline/no-sync-wait=false/parallel=1-10 1.37kB ± 0% 1.40kB ± 0% +2.31% (p=0.008 n=5+5) CommitPipeline/no-sync-wait=false/parallel=2-10 1.37kB ± 0% 1.40kB ± 0% +2.31% (p=0.008 n=5+5) CommitPipeline/no-sync-wait=false/parallel=4-10 1.37kB ± 0% 1.40kB ± 0% +2.15% (p=0.008 n=5+5) CommitPipeline/no-sync-wait=false/parallel=8-10 1.37kB ± 0% 1.40kB ± 0% +2.34% (p=0.008 n=5+5) name old allocs/op new allocs/op delta CommitPipeline/no-sync-wait=false/parallel=1-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/no-sync-wait=false/parallel=2-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/no-sync-wait=false/parallel=4-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/no-sync-wait=false/parallel=8-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) Sync wait on this branch (old) vs async wait on this branch (new): name old time/op new time/op delta CommitPipeline/parallel=1-10 1.15µs ± 9% 1.20µs ± 7% ~ (p=0.421 n=5+5) CommitPipeline/parallel=2-10 1.54µs ± 2% 1.59µs ± 2% +3.50% (p=0.008 n=5+5) CommitPipeline/parallel=4-10 1.59µs ± 1% 1.58µs ± 1% ~ (p=0.802 n=5+5) CommitPipeline/parallel=8-10 1.55µs ± 1% 1.56µs ± 1% ~ (p=0.452 n=5+5) name old speed new speed delta CommitPipeline/parallel=1-10 13.9MB/s ±10% 13.3MB/s ± 7% ~ (p=0.421 n=5+5) CommitPipeline/parallel=2-10 10.4MB/s ± 2% 10.1MB/s ± 2% -3.36% (p=0.008 n=5+5) CommitPipeline/parallel=4-10 10.1MB/s ± 1% 10.1MB/s ± 1% ~ (p=0.786 n=5+5) CommitPipeline/parallel=8-10 10.3MB/s ± 1% 10.3MB/s ± 1% ~ (p=0.452 n=5+5) name old alloc/op new alloc/op delta CommitPipeline/parallel=1-10 1.40kB ± 0% 1.40kB ± 0% ~ (p=0.651 n=5+5) CommitPipeline/parallel=2-10 1.40kB ± 0% 1.39kB ± 0% -0.21% (p=0.008 n=5+5) CommitPipeline/parallel=4-10 1.40kB ± 0% 1.40kB ± 0% ~ (p=0.706 n=5+5) CommitPipeline/parallel=8-10 1.40kB ± 0% 1.40kB ± 0% ~ (p=0.587 n=5+5) name old allocs/op new allocs/op delta CommitPipeline/parallel=1-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/parallel=2-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/parallel=4-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) CommitPipeline/parallel=8-10 2.00 ± 0% 2.00 ± 0% ~ (all equal) Informs cockroachdb/cockroach#17500 See discussion thread cockroachdb/cockroach#87050 (review)
See #17500.
This commit polishes and flushes a prototype I've had lying around that demonstrates the async Raft log appends component of #17500. I'm not actively planning to productionize this, but it sounds like we may work on this project in v23.1, so this prototype might help. It also demonstrates the kind of performance wins we can expect to see on write-heavy workloads. To this point, we had only demonstrated the potential speedup in a simulated environment with rafttoy.
Half of the change here is to
etcd/raft
itself, which needs to be adapted to support asynchronous log writes. These changes are presented in nvanbenschoten/etcd@1d1fa32.The other half of the change is extracting a Raft log writer component that handles the process of asynchronously appending to a collection of Raft logs and notifying individual replicas about the eventual durability of these writes. This component is pretty basic and should probably be entirely rewritten, but it gets the job done for the prototype.
The Raft log writer reveals an interesting dynamic where concurrency at this level actually hurts performance because it leads to concurrent calls to sync Pebble's WAL, which is less performant than having a single caller due to the fact that Pebble only exposes a synchronous Sync API and coalesces all Sync requests on to a single thread. An async Pebble Sync API would be valuable here. See the comment in NewWriter for more details.
Benchmarks
These are compelling results. I haven't pushed on this enough to know whether there's actually a throughput win here, or whether the fixed concurrency and reduced average latency is just making it look like there is.
kv0bench
should help answer that question.cc. @erikgrinaker